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ABSTRACT 

This paper presents the contribution to the third ’CHiME’ 
speech separation and recognition challenge including both 
front-end signal processing and back-end speech recognition. 
In the front-end, Multi-channel Wiener filter (MWF) is de¬ 
signed to achieve background noise reduction. Different from 
traditional MWF, optimized parameter for the tradeoff be¬ 
tween noise reduction and target signal distortion is built ac¬ 
cording to the desired noise reduction level. In the back-end, 
several techniques are taken advantage to improve the noisy 
Automatic Speech Recognition (ASR) performance including 
Deep Neural Network (DNN), Convolutional Neural Network 
(CNN) and Fong short-term memory (FSTM) using medium 
vocabulary, Fattice rescoring with a big vocabulary language 
model finite state transducer, and ROVER scheme. Experi¬ 
mental results show the proposed system combining front-end 
and back-end is effective to improve the ASR performance. 

Index Terms — CHiME challenge, Multi-channel Wiener 
filter. Deep Neural Network, Noise Robust, Automatic Speech 
Recognition 

1. INTRODUCTION 

Automatic Speech Recognition (ASR) has been applied to 
many human-computer interaction systems, such as tablet 
computer, smartphones, personal computers and televisions. 
Meanwhile, robust ASR in noisy environments is paid more 
attention due to its applicable value. The 3rd ’CHiME’ speech 
separation and recognition challenge is such a platform for 
testing the recognition rate of noisy speech in complex envi¬ 
ronments m. Our contributions to CHiME are separated into 
two parts; front-end techniques and back-end techniques. 

It is well known that a lot of front-end techniques aim at 
extracting clean desired speech signals. Among them, multi¬ 
channel system is proved effective to improve the front-end 
performance in noisy and reverberant environment so that it 
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attracts more attention in consideration of better balance be¬ 
tween noise reduction and speech distortion. As is known 
to all, more noise reduction doesn’t mean more clean de¬ 
sired speech. Speech distortion brought by artifacts affects 
ASR performance severely. Therefore, taking speech distor¬ 
tion into account in the multi-channel optimization criterion, 
multi-channel wiener filter (WMF) technique has been pro¬ 
posed to estimate the desired speech component in noisy en¬ 
vironment 12. The technique is generalized as speech dis¬ 
tortion weighted MWF (SDW-MWF). The tradeoff between 
noise reduction and speech distortion is taken into consider¬ 
ation. In principle, it is desired to have less noise reduction 
in speech dominant segments and more noise reduction oth¬ 
erwise. From this motivation, we improve the SDW-MWF 
by focusing on the tradeoff parameter optimization from the 
perspective of desired noise reduction control technique. 

Recently, acoustic modelling based on the Deep Neural 
Networks (DNNs) has gained popularity with the consistent 
improvement in recognition performance over earlier Neu¬ 
ral Network based front-ends (e.g. 13 ). DNNs are either de¬ 
ployed as the front-end for standard Hidden Markov Model 
based on Gaussian Mixture Models (HMM-GMMs), or in 
a hybrid form to directly estimate state level posteriors. As 
noted in several publications 012161121, DNNs show general 
word error rate (WER) improvements on the order of 10-30% 
relative across a variety of small and large vocabulary tasks 
when compared with HMM-GMMs built on classic features. 
A DNN is a conventional Multi-Fayer Perceptron (MFP) with 
many internal or hidden layers. Convolutional Neural Net¬ 
works (CNNs) are an alternative type of neural network that 
can be used to reduce spectral variations and model spectral 
correlations which exist in signals. CNNs are a more effec¬ 
tive model for speech compared to DNNs HI. Besides, Fong 
Short-Term Memory (FSTM) is also a specific recurrent neu¬ 
ral network (RNN) architecture that was designed to model 
temporal sequences and their long-range dependencies more 
accurately than conventional RNNs. FSTM are also proved 
more effective than DNNs and conventional RNNs for acous¬ 
tic modeling Eiiini. In this paper, we take advantage of these 
techniques for acoustic modeling and make a combination of 
them to achieve a better ASR performance M. 

This paper is organized as follows. In section 2, 3, we 
describe the front-end and back-end of the proposed system. 



In section 4, we carry out ASR experiments and list the results 
with analysis. At last, we draw a conclusion in section 5. 

2. SPEECH ENHANCEMENT ERONT-END 

In order to suppress background noise, multichannel wiener 
filter (MWF) is introduced to the multi-microphone set-up 
m- Since MWF does not require transfer functions between a 
target speaker and microphones, it is suitable for the CHiME3 
task. Taking speech distortion into account in its optimization 
criterion, MWF is generalized as speech distortion weighted 
multichannel wiener filter (SDW-MWF), which provides a 
tradeoff between speech distortion and noise reduction lfT2l 
MM- In this work, a tradeoff parameter optimized method 
based on SDW-MWF is used. 

Considering an array of M microphones. Let 
m = denote the short-time Fourier transform 

(STFT) domain notation of m-th microphone signal at fre¬ 
quency index k and frame index I, the received signals are 
given as 

Ymik,l) = S{k,l)Gm{k,l)+N^{k,l) 

= X^{k,l)+Nm{k,l) (1) 

where S{k,l),Gm{k,l),Xm{k,l),Nm{k,l) are respectively 
the STFT domain expression of the source signal s{t), the 
transfer function from the source to the m-th microphone 
gm{t), the target signal Xm{t) and noise signal nm{t) at 
microphone m. 

To find an optimal estimate of the target signal, the de¬ 
signed SDW-MWF criterion is ifTSlfTSl 

wsDW-MWF = arg min E{\w^y - Xip -h (2) 

where Xi is the target signal at the first microphone, y{k, 1) 
is the received signal vector defined as y{k, 1) 

= [Yi{k, I),. ■YM{k, l)]'^ and w^{k, 1), x{k, 1), n{k, 1), 
g{k,l) are defined similarly, among which w{k,l) represents 
the linear filter given by Z) = [Wi{k,l),... ,WMik,l)]'^. 

Here operators (.)^ and {.)H represent the transposition and 
Hermitian transpose operation respectively. Apparently, a 
larger value of p emphasize more on noise reduction. Vari¬ 
ables k and I are omitted here for simplicity. The solution to 
SDW-MWF can be obtained as 

WSDW-MWF = l^xx + g^nn]~^^xxUl (3) 

where Ui = [1... 0 ... 0]^ is a M-dimensional vector corre¬ 
sponds to the first microphone (channel 1 of the 6-microphone 
array), ^xx and are the correlation matrices of clean 
speech signal and noise signal, respectively. 

Using a fixed parameter p, the reduced residual noise level 
generally achieved at the expense of increased speech distor¬ 
tion. In our work, we compute the parameter according to 
desired noise reduction level. 

(4) 


where SNRi denotes the imput signal-to-noise ratio (SNR) of 
the first microphone, s is a noise reduction control factor de¬ 
fined as s = 4'ninil4'o, 4>nini represents the noise power at 
the first microphone, and (po represents desired residual noise 
level. Apparently, when the background noise level is rela¬ 
tively high or the input SNR is relatively low, the optimized 
parameter will emphasize more on noise reduction, which is 
reasonable. In this work, the noise power and noise covari¬ 
ance matrix for each frequency bin are computed from the 
initial and final 10 frames of each utterance. 

3. BACK-END DESCRIPTION 

3.1. Acousitic modeling with neural network 

FiglU demonstrates the back-end description including the 
techniques we used of the proposed system. 

The GMM baseline includes the standard triphone based 
acoustic models with various feature transformations includ¬ 
ing linear discriminant analysis (LDA), maximum likelihood 
linear transformation (MLLT), and feature space maximum 
likelihood linear regression (fMLLR) with speaker adaptive 
training (SAT). 

The DNN baseline provides the state-of-the-art ASR per¬ 
formance. It is based on the Kaldi recipe for Track 2 of the 
2nd CHiME Challenge ifThl . The DNN is trained using the 
standard procedure (pre-training using restricted Boltzmann 
machine, cross entropy training, and sequence discriminative 
training). This baseline requires relatively massive computa¬ 
tional resources (GPUs for the DNN training and many CPUs 
for lattice generation). 

We start DNN training based on scripts of baseline sys¬ 
tem. We use 7 hidden layers and 2048 nodes for each hidden 
layer. The features for the DNN training are 40-dimensional 
filter-bank and its delta, delta-delta features. A context win¬ 
dow of 11 frames (5 h-1h- 5) is used so that the dimension of the 
input layer for DNN is 40*3*11. Cepstral Mean and Variance 
Normalization (CMVN) is applied and proves to be useful. 
The DNN output layer size is the same as the GMM-HMM, 
which is 2024. The DNN is trained using the standard proce¬ 
dure like baseline system. 

The CNN uses fbank-tpitch features and contains two con¬ 
volutional hidden layers and a max-pooling layer. The input 
feature vector (not including pitch) is divided into 40 bands. 
The corresponding dimension of the 11 consecutive feature 
frames are arranged in each band, together with their deriva¬ 
tives. So that the input dimension of the CNN is 43*3*11. 
The first set of convolutional filters are applied to 8 consecu¬ 
tive bands and generate 128 feature mappings. We then apply 
max-pooling across 3 bands to generate 11 bands. The second 
set of convolutional filters are applied to 4 consecutive bands 
and generate 256 feature mappings. Four fully-connected hid¬ 
den layers of 1024 nodes are arranged after the convolutional 
layers. The total number of parameters for the CNN is 7.7M. 


fi = min(s, s/SNRi) 



Result 


Fig. 1. Back-end description 


The LSTM network used in this paper is a two layer 
LSTM RNN, where each LSTM layer has 1024 memory cells 
and a dimensionality reducing recurrent projection layer of 
200 linear units Emni. 

In our experiments, we use an official trigram language 
model (LM) on the initial decoding pass and use a 5-order 
LM for lattice rescoring in a second pass. The official trigram 
LM has 5k vocabularies. The 5-order LM is trained using 
official training data only, but has vocabularies up to 12k. 

3.2. Combination of different systems 

To combine these multiple speech recognition outputs into a 
single one, we employ ROVER at the decision level HD in 
the final step. The fusion enables us to achieve a lower error 
rate than any of the individual systems alone. In this paper, 
NIST scoring toolkit (SCTK,version 1.3) is used as a rover 
tool to combine the different results. It takes N input files 
and does an N-way dynamic programming (DP) alignment 
on those files. The output is a voted output depending the 
maximum confidence score. 

4. EXPERIMENTS AND RESULTS 

The experiments are all carried out following the instruc¬ 
tions of CHiME challenge. In this section, we list the ASR 
improvement step by step according to each technique we 
used resulting in the final WER of the test set provided by 


CHiME challenge. Table[D gives the GMM and DNN base¬ 
lines ’CHiME’ provided. Table]!] shows the ASR results by 
the proposed system and Table]3]shows the ASR results under 
each scenario including the bus (BUS), cafe (CAE), pedes¬ 
trian area (PED), and street junction (STR) according to the 
best system after ROVER. 

4.1. ASR performance of front-end speech enhancement 

As mentioned above, front-end speech enhancement brings 
benefits to the ASR performance. Table]2] demonstrates that 
WER of real test data decreases from 37.36% to 23.19% by 
changing the speech enhancement method from MVDR (sup¬ 
plied by CHiME organizers liTTl l to the proposed SDW-MWE 
under GMM acoustic model. If we randomize the SNR of 
training data from -6dB-6dB (denoted by Random SNR in 
Table]2|l instead of the estimated SNR calculated from really 
recorded data for simulating training set, the WER decreases 
to 22.07%. 

Under DNNh-sMBR acoustic model, the WER decreases 
from 33.76% to 18.4% on test data using SDW-MWE and ran¬ 
dom SNR schemes. It is worthy mentioning that all the train¬ 
ing data is enhanced to compensate the mismatch between the 
training data and test data. 

4.2. Back-end ASR performance 

The results of DNN model on the development and evaluation 
set are also given in Table]2] we can see that DNN get 16.63% 


































Model 

Test Data 

Training Data 

Dev. Set 

Test Set 

Real 

Sim. 

Real 

Sim. 

GMM 

noisy 

clean 

55.65 

50.25 

79.84 

63.30 

noisy 

18.70 

18.71 

33.23 

21.59 

MVDR 

clean 

41.88 

21.72 

78.12 

25.63 

MVDR 

20.55 

9.79 

37.36 

10.59 

DNN+sMBR 

noisy 

noisy 

16.13 

14.30 

33.43 

21.51 

DNN+sMBR 

MVDR 

MVDR 

17.72 

8.17 

33.76 

11.19 


Table 1. WER Baselines from the 3rd CHiME challenge. 


Model 

Test Data 

Training Data 

Dev. Set 

Test Set 

Real 

Sim. 

Real 

Sim. 

GMM 

SDW-MWE 

Clean 

30.23 

29.75 

53.43 

41.58 

SDW-MWE 

13.16 

14.11 

23.19 

18.65 

GMM 

SDW-MWE 

Random SNR+SDW-MWF 

13.01 

13.95 

22.07 

17.57 

GMM+Rescore 



11.61 

12.37 

20.35 

15.7 

DNN+sMBR 

SDW-MWE 

Random SNR+SDW-MWF 

9.95 

10.03 

18.4 

12.98 

DNN+SMBR+Rescore 



8.48 

9.01 

15.3 

11.29 

CNN+sMBR 

SDW-MWE 

Random SNR+SDW-MWF 

9.52 

9.64 

17.87 

12.64 

CNN+SMBR+Rescore 



8.51 

8.77 

16.37 

11.55 

LSTM 

SDW-MWE 

Random SNR+SDW-MWF 

10.81 

11.18 

18.96 

14.1 

LSTM+Rescore 



9.44 

9.71 

16.45 

12.48 

ROVER 

SDW-MWE 

Random SNR+SDW-MWF 

7.29 

7.68 

13.2 

9.71 


Table 2. WERs of proposed system. 


relative WER reduction comparing with GMM system on the 
real data of the test set. Obviously, the improvement is not 
enough, then we tried to use several other NN topologies. 

As it is shown in Table |2] the CNN acoustic models as it 
has shown superior performance over conventional DNN. The 
WER decreases from 18.4% to 17.87%. Table |2] shows that 
LSTM gets further improvement. 14.09% relative reduction 
was achieved comparing to GMM. After lattice rescoring, all 
of the systems get significantly improvement. 

Einally the best ASR result was obtained by combining 
all the systems with lattice rescoring together. We achieve a 
final WER of 13.2% on the real data of the test set, resulting 
in a 60.9% relative reduction in WER compared to the result 
of 33.23% from the best GMM-baseline. Tablel3] shows the 
detail ASR results under different recording scenarios. 

The best single system is the DNN+sMBR using lattice 
rescoring shown by Tablel2] 


Environment 

Dev. Set 

Test Set 

Real 

Sim. 

Real 

Sim. 

BUS 

8.88 

6.77 

17.74 

7.4 

CAE 

7.08 

9.94 

11.75 

10.95 

FED 

5.78 

6.14 

13.34 

9.19 

STR 

7.4 

7.89 

9.96 

11.32 


Table 3. WERs of the best system under different environ¬ 
ments. 


5. CONCLUSION 

A state-of-the-art ASR system is presented in this paper fac¬ 
ing with the task of reducing the effects of noise under dif¬ 
ferent real applicable scenarios using a 6-microphone array. 
Two aspects are stated separately. Eront-end speech enhance¬ 
ment using SDW-MWE achieves considerable performance 
improvement. Back-end techniques including GMM, DNN, 
CNN and LSTM are investigated. The combination of the 
four systems with lattice rescoring has the best ASR perfor¬ 
mance on the develop and test set. we achieve a relative 
60.9% WER reduction on the real data of the test data com¬ 
pared to the best baseline system. 
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